Unstructured information integration through data-driven similarity discovery

نویسندگان

  • Rema Ananthanarayanan
  • Sreeram Balakrishnan
  • Berthold Reinwald
  • Yuen Yee
چکیده

Information integration from multiple heterogeneous sources is one of the major challenges facing enterprises and service providers today, and one of the important problems in this domain is the integration of structured and unstructured (or text) data. In this paper we describe our work on a data-driven approach to integrating various sources of text data, without relying on the availability of schema information. To this end, we have used various existing tools from natural language processing, data mining and related areas in a novel manner. The tools are used at the ’preprocessing’ stage to (a) characterise each set of unstructured information (or collection of text data), (b) identify the related sets of unstructured information and (c) relate these sets to various reference data sets. All these steps are based solely on the instance values of the data sets. Subsequently the information compiled in the preprocessing stage may be used at query time to query the structured and text data. We also present our results on applying our techniques for data integration across multiple unstructured data sources, relating to customer comments of a service provider.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Adaptive Information Analysis in Higher Education Institutes

Information integration plays an important role in academic environments since it provides a comprehensive view of education data and enables mangers to analyze and evaluate the effectiveness of education processes. However, the problem in the traditional information integration is the lack of personalization due to weak information resource or unavailability of analysis functionality. In this ...

متن کامل

Adaptive Information Analysis in Higher Education Institutes

Information integration plays an important role in academic environments since it provides a comprehensive view of education data and enables mangers to analyze and evaluate the effectiveness of education processes. However, the problem in the traditional information integration is the lack of personalization due to weak information resource or unavailability of analysis functionality. In this ...

متن کامل

Towards virtual knowledge broker services for semantic integration of life science literature and data sources.

Research in the life sciences requires ready access to primary data, derived information and relevant knowledge from a multitude of sources. Integration and interoperability of such resources are crucial for sharing content across research domains relevant to the life sciences. In this article we present a perspective review of data integration with emphasis on a semantics driven approach to da...

متن کامل

A Knowledge-Driven Geospatially Enabled Framework for Geological Big Data

Geologic survey procedures accumulate large volumes of structured and unstructured data. Fully exploiting the knowledge and information that are included in geological big data and improving the accessibility of large volumes of data are important endeavors. In this paper, which is based on the architecture of the geological survey information cloud-computing platform (GSICCP) and big-data-rela...

متن کامل

Models and Indices for Integrating Unstructured Data with a Relational Database

Database systems are islands of structure in a sea of unstructured data sources. Several real-world applications now need to create bridges for smooth integration of semi-structured sources with existing structured databases for seamless querying. This integration requires extracting structured column values from the unstructured source and mapping them to known database entities. Existing meth...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009